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Abstract 

This paper proposes a novel similarity measure for clustering sequential data. 
We first construct a common state-space by training a single probabilistic 
model with all the sequences in order to get a unified representation for the 
dataset. Then, distances are obtained attending to the transition matrices 
induced by each sequence in that state-space. This approach solves some 
of the usual overfitting and scalabihty issues of the existing semi-parametric 
techniques, that rely on training a model for each sequence. Empirical studies 
on both synthetic and real-world datasets illustrate the advantages of the 
proposed similarity measure for clustering sequences. 

Keywords: Sequential data. Clustering 
1. Introduction 

Clustering is a a core task in machine learning and data processing. It is 
an unsupervised technique whose goal is to unveil a natural partition of data 
into a number of groups or clusters. Clustering techniques have been widely 
studied and applied for a long time, yielding a large number of well-known 
and efficient algorithms [1]. Recently, the family of algorithms collectively 
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known as spectral clustering [2, 3], that connect the clustering problem with 
the kernel methods, have received a lot of attention due to their good perfor- 
mance. They share a graph-theoretic approach that results in non-parametric 
partitions of a dataset, in the sense that they do not require any parametric 
model of the input data. 

The application of clustering techniques to sequential data is lately re- 
ceiving growing attention [4]. In these scenarios, the useful information is not 
encoded only in the data vectors themselves, but also in the way they evolve 
along a certain dimension (usually time). Examples of sequential data range 
from stock market analysis to audio signals, video sequences, etc. Developing 
machine learning techniques for these scenarios poses additional difficulties 
compared to the classical setting where data vectors are assumed to be inde- 
pendent and identically distributed (i.i.d.). The common framework consists 
in combining a distance or similarity measure between sequences of variable 
length, which captures their dynamical behavior, with a clustering algorithm 
developed for i.i.d. data. 

The design of similarity measures for sequences is generally addressed 
from a model-based perspective. Specifically, hidden Markov models (HMMs) 
[5] are usually employed as models for the sequences in the dataset. HMMs 
have been widely used in signal processing and pattern recognition because 
they oofer a good trade-off between complexity and expressive power. Based 
on the work of Smyth [6], many researchers [7, 8, 9, 10] have proposed dif- 
ferent distance measures based on a likelihood matrix obtained using each 
single sequence to train an HMM. Besides, [11] defines the similarity be- 
tween two sequences as the probabihty product kernel (PPK) between the 
HMMs trained on each sequence. The apphcation of a non-parametric clus- 
tering to the distance matrix obtained in one of the aforementioned ways 
yields a semi-parametric model: each individual sequence follows 
metric model, but no assumption is made about the global cluster structure. 
These semi-parametric methods have been shown [10, 11] to outperform both 
fully parametric methods such as mixture of HMMs [12] or combinations of 
HMMs and dynamic time warping [13]. 

This paper aims to solve a main weakness of the aforementioned semi- 
parametric models. The fact that each model is trained using just one se- 
quence can lead to severe overfitting or non-representative models for short 
or noisy sequences. In addition, the learning of a semi-parametric model 
involves the calculation of a number of likelihoods or probability product 
kernels that is quadratic in the number of sequences, which hinders the scal- 
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ability of the method. To overcome these disadvantages, wc propose to train 
a single HMM using all the sequences in the dataset, and then cluster the 
sequences attending to the transition matrices they induce in the state-space 
of the common HMM. 

The rest of the paper is organized as follows: Section 2 starts with a brief 
review of hidden Markov models and spectral clustering followed by a presen- 
tation of the state of the art in model-based clustering of sequences. Section 
3 describes how to cluster sequences using information about their dynam- 
ics in a common state-space. This new proposal is empirically evaluated in 
comparison with other methods in Section 4. Finally, Section 5 draws the 
main conclusions of this work and sketches some promising lines for future 
research. 

2. Clustering Sequences with Hidden Markov Models 

This section reviews the state-of-the-art framework employed to cluster 
sequential data, which consists of two phases: (1) The design of a similarity 
or distance measure for sequences based on Hidden Markov Models; and 
(2) the use of that measure in a clustering algorithm. We opt for spectral 
clustering due to its good results reported in the literature. Nonetheless, 
the distances presented in this work can be used in combination with any 
clustering algorithm. 

2.1. Hidden Markov Models 

Hidden Markov models (HMMs) [5] are a type of parametric, discrete 
state-space model. They provide a convenient model for many real-life phe- 
nomena, while allowing for low- complexity algorithms for inference and learn- 
ing. Their main assumptions are the independence of the observations given 
the hidden states and that these states follow a Markov chain. 

Assume a sequence S of T observation vectors S = {xi,...,Xr}. The 
HMM assumes that Xj, the t*^ observation of the sequence, is generated ac- 
cording to the conditional emission density p(xj|gj), with being the hidden 
state at time t. The state qt can take values from a discrete set {si, . . . , sk} 
of size K. The hidden states evolve following a time-homogeneous first-order 
Markov chain, so that p{qt\qt-i, qt-2, ■ ■ ■,qo) = p{qt\qt-i)- 

In this manner, the parameter set 6 that defines an HMM consists of the 
following distributions: 

• The initial probabilities vector tt = {tTj}^^, where tTj — p{qo — Sj). 
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• The state transition probability, encoded in a matrix A = {flij}jj=i 
with Qij = p{qt+i = Sj\qt = Sj), 1 <i,j < K. 

• The emission pdf for each hidden state pi'X-tlQt — Si), I < i < K. 

From these definitions, the hkehhood of a sequence S = {xi, . . . , x^} can 
be written in the following factorized way: 

T 

= J2 Vi^o\qo)Ylpi^t\qt)aq,_,,q,. (1) 

qo,-,qT t=l 

The training of this kind of models in a maximum likelihood setting is 
usually accomplished using the Baum- Welch method [5], which is a partic- 
ularization of the well-known EM algorithm. The E-step finds the expected 
state occupancy and transition probabilities, which can be done efficiently 
using the forward-backward algorithm [5]. This algorithm implies the calcu- 
lation of both the forward a and backward /3 variables that are defined as 
follows: 

afc(t) = p(xi,...,xt,5t = (2) 
l^kit) = P(xt+i, . . . , xrlqt = Sk). (3) 

These variables can be obtained in 0{K'^T) time through a recursive proce- 
dure and can be used to rewrite the hkelihood from Eq. (1) in the following 
manner: 

K 

p(S|e)= (4) 

k=l 

which holds for all values oftG{l,...,T}. 

Given a previously estimated A, the state transition probabilities can be 
updated using the forward/backward variables and that previous estimation, 
yielding: 

T 

ttij (X ai{t')aijp{xt'+i\qt'+i = Sj)Pj{t' + 1). (5) 
t'=i 

Then, the M-step updates the parameters in order to maximize the likehhood 
given the expected hidden states sequence. These two steps are then iterated 
until convergence. It is worth noting that the likelihood function can have 
many local maxima, and this algorithm does not guarantee convergence to 
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the global optimum. Due to this, it is common practice to repeat the training 
several times using different initializations and then select as the correct run 
the one providing a larger likelihood. 

The extension of this training procedure to multiple input sequences is 
straightforward. The interested reader is referred to [5] for a complete de- 
scription. 

2.2. Spectral Clustering 

Clustering [1] consists in partitioning a dataset S comprised of N elements 
into C disjoint groups called clusters. Data assigned to the same cluster must 
be similar and, at the same time, distinct from data assigned to the rest of 
clusters. It is an unsupervised learning problem, meaning that it does not 
require any prior labeling of the data, and thus it is very appropriate for 
exploratory data analysis or scenarios where obtaining such a labeling is 
costly. 

Algebraically, a clustering problem can be formulated in the following 
way. Given a dataset S, one forms a. N x N similarity matrix W, whose ij*'* 
element Wij represents the similarity between the i*^ and j*^ instances. The 
clustering problem then consists in obtaining a. N x C clustering matrix Z, 
where Zic = 1 if instance i belongs to cluster c and Zic — otherwise, which 
is optimal under some criterium. 

Spectral clustering (SC) algorithms [3] approach the clustering task from 
a graph-theoretic perspective. Data instances form the nodes V of a weighted 
graph G = (V, E) whose edges E represent the similarity or adjacency be- 
tween data, defined by the matrix W. This way, the clustering problem is 
cast into a graph partitioning one. The clusters are given by the partition 
of G in C groups that optimize certain criteria such as the normalized cut 
[14]. Finding such an optimal partition is an NP-hard problem, but it can 
be relaxed into an eigenvalue problem on the Laplacian matrix L = D — W, 
where D is the diagonal matrix with elements da = Yl!j=i''^iji or one of 
its normalized versions, followed by /c-means or any other clustering algo- 
rithm on the rows of the matrix of selected eigenvectors. Specifically, in the 
experiments in this paper, we employ the spectral clustering algorithm de- 
fined in [2], which uses the symmetric version of the normalized Laplacian 
Lsym = D~^/^LD~^/^. The actual clustering is applied to the normahzed 
rows of the C eigenvectors associated with the lowest eigenvalues. It is worth 
noting that the embedding of the original data into the rows of the normal- 
ized Laplacian eigenvectors also have some appealing properties from the 
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point of view of dimensionality reduction [15]. 

The time complexity for the spectral clustering is dominated by the eigen- 
decomposition of the normalized Laplacian, which in general is 0{N^). How- 
ever, if the affinity matrix is sparse (e.g. if only the affinities between the 
nearest neighbors of a given node are considered), there exist efficient it- 
erative methods that notably reduce this complexity, such as the Lanczos 
method [16] , which makes it feasible even for large datasets [3] . 

2.3. Semi-Parametric Model-Based Sequence Clustering 

Semi-parametric sequence clustering methods make no assumptions about 
the cluster structure. These methods typically use generative models such as 
HMMs in order to represent the sequences in the dataset in a common space, 
and then apply a non-parametric clustering algorithm. Following [6], many 
methods use a log-likelihood matrix L whose ij*^ element is defined as 

= logp,,- = ^^^g^^g^.^) logp(S,|^i), l<i,j <N, (6) 

where 9i is the model trained for the i*'* sequence Sj and N is the number of 
sequences in the dataset. The log-likelihoods are normalized with respect to 
the length of the sequences to compensate the exponentially increasing size 
of the probability space. 

The literature reports several methods to construct a similarity (distance) 
matrix D from L, such as: 

• The "symmetrized distance" (SYM) [6]: dfj^ = \{lij + lji). 

• The BP distance in [8]: df/ = \ {^^^ + ^^1^}' ^^^^^ ^^^^^ 
account how well a model represents the sequence it has been trained 
on. 

• The Yin- Yang distance of [10] dJJ = + Ijj — Lij — lji\, 

The distance matrix D is then fed into a clustering algorithm that partitions 
the set of sequences. It is worth noting that all these methods define the 
distance between two sequences Sj and Sj using solely the models trained on 
these particular sequences {9i and 9j). 

In [11] another method for constructing the similarity matrix in a model- 
based approach is proposed which avoids the calculation of the likelihood 
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matrix L. Again an HMM is trained on each individual sequence, but then 
similarities between sequences are computed directly through a probability 
product kernel (PPK). Specifically, this kernel is obtained by integrating a 
product of the probability distributions spanned by two HMMs in the space 
of the sequences of a fixed length TppK, which is a free parameter. This way, 
L is no longer necessary because the similarities are obtained directly from 
parameters of the models. The calculation of the PPK between two HMMs 
can be carried out in 0{K^Tppy^) time. 

Furthermore, the method proposed in [7] assumes the existence of a latent 
model space 6 formed by some HMMs that actually span the data space. 
Then, the models 9i,. . . ,9^ trained on each sequence are regarded as a set 6 
of intelligently sampled points of 0. Let L^v be a column-wise normalisation 
of L. The i*'* column of L^v can be interpreted as a probability density 
function /|^(^) over the approximated model space 6 for Sj. This way it is 
possible to re-express Lat as: 



This interpretation leads to a distance measure consisting in KuUback-Leibler 
(KL) divergences between the columns of L^v, so: 



where -Dklsym stands for the symmetrized version of the KL divergence: 
-^KLsYMblk) = H^KL(]?||g) + ^KL(g||p))- Thls Way, the distance between 
two sequences is obtained in a global way, using a data-dependent repre- 
sentation. Under this paradigm, it is possible to select a subset of P < N 
sequences to train individual models on, instead of fitting an HMM to every 
sequence in the dataset. This can reduce the computational load and im- 
prove the performance in real-world data. However, the a priori selection of 
such a subset remains an open problem. 

The aforementioned semi-parametric methods share the need to train an 
individual HMM on each sequence in the dataset (or in a subset of it, as in 
[7]). We see this as a disadvantage for several reasons. Short sequences are 
hkely to result in overfitted models, giving unreahstic results when used to 
evaluate likelihoods or PPKs. Moreover, training individual models in an 
isolated manner prevents the use of similar sequences to get more accurate 
representations of the states. As for the computational complexity, these 
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methods do not scale well with the dataset size A^. Specifically, the number 
of likelihoods that need to be obtained is N"^ (or PN using the KL method). 
In the case of PPKs, N"^ /2 evaluations are required, since the kernel is sym- 
metric. 

3. State Space Dynamics (SSD) Distance 

In this paper, we propose to take a different approach in order to overcome 
the need to fit an HMM to each sequence. To this end, we propose to train 
a single, large HMM 9 of K hidden states using all the sequences in the 
dataset. This will allow for a better estimation of the emission probabilities 
of the hidden states, compared to the case where an HMM is trained on each 
sequence. Then, we use the state-space of 6' as a common representation 
for the sequences. Each sequence S„ is linked to the common state-space 
through the transition matrix that it induces when is fed into the model. 
This matrix is denoted as A" — \afA . . where 

<- J 1,3=1 

~a^,=piq^+, = Sj\q^ = Si,Sn,9). (8) 

In order to obtain each A", we run the forward-backward algorithm for 
sequence S„ under the parameters 9 (including the learned transition matrix 
A = {dij}) and then obtain the sequence-specific transition probabilities by 
using equation (5): 

T 

oc J2 ant')aijPi^t'+i\qt'^i = sj)PJit' + 1), (9) 
t'=i 

where Q;"(t) and /3'j{t' + 1) are the forward and backward variables for S„, 
respectively. This process can be seen as a projection of the dynamical 
characteristics of S„ onto the state-space defined by the common model 9. 
Therefore, the overall transition matrix A of the large model 9 act as a com- 
mon, data-dependent "prior" for the estimation of these individual transition 
matrices. 

This procedure is somewhat equivalent to obtaining individual HMMs 
with emission distributions that are shared or "clamped" amongst the differ- 
ent models. Clamping is a usual and useful tool when one wants to reduce 
the number of free parameters of a model in order to either obtain a better 
estimate or reduce the computational load. In our case, the effects of clamp- 
ing the emission distributions are two-fold: we get the usual benefit of better 
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estimated parameters and, at the same time, it allows for simple distance 
measures between hidden Markov models using the transition distributions. 
This happens because the transition processes of the different models now 
share a common support, namely the fixed set of emission distributions. 

As previously mentioned, running the forward-backward algorithm im- 
plies a time complexity of 0{K'^T) for a sequence of length T, which is 
the same complexity required for obtaining the likelihood of an HMM. Our 
proposal only requires N of these calculations, instead of N'^ likelihood eval- 
uations or PPKs as in the methods mentioned in the previous section. 
This makes the SSD algorithm a valuable method for working with large 
datasets. 

At this point, we have each sequence S„ represented by its induced tran- 
sition matrix A". In order to define a meaningful distance measure between 
these matrices, we can think of each A" = [a„i, . . . ,a„/^]"^ as a collection 
of K discrete probability functions cini, . . . , a.nK, one per row, corresponding 
with the transition probabilities from each state to every other state. In 
this manner, the problem of determining the affinity between sequences can 
finally be transformed into the well-studied problem of measuring similarity 
between distributions. In this work, we employ the Bhattacharyya affinity 
[17], defined as: 

-Db(Pi,P2) = VPi(^)P2ix), (10) 

X 

where pi and p2 arc discrete probability distributions. We consider the affin- 
ity between two transition matrices to be the mean affinity between their 
rows. The distance between two sequences Sj and Sj can then be written as: 

C = -log-^f]i?i.fe.,P,.). (11) 
fc=i 

Other approaches could be used in order to define distances between 
the different transition matrices. For example, instead of using A" directly, 
an idea similar to diffusion distances [18] can be applied by using different 

powers of the transition matrices ^A'^^ , where t is a time index. This is 

equivalent to iterating the random walk defined by the transition matrices 
for t time steps. The j*'* row of such an iterated transition matrix encodes 
the probabilities of transitioning from state j to each other state in t time 
steps. However, this introduces the extra parameter t, which must be set 
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very carefully. For example, many transition matrices converge very quickly 
to the stationary distribution even for low t (specially if the number of states 
is small). This can be a problem in cases where the stationary distributions 
for sequences in different clusters are the same. An example of such a scenario 
is presented in Section 4. 

Moreover, the SSD distance measure is very flexible. Measuring distances 
between sequences is highly subjective and application dependant. For ex- 
ample, in a certain scenario we may not be interested in the rest time for 
each state, but only in the transitions (similar to Dynamic Time Warping 
[19]). To this end, a good alternative would be to obtain the transition ma- 
trices A" for every sequence, but ignore the self transitions in the distance 
measurement. That can be easily done by setting all the self-transitions to 
and then renormalizing the rows of the resulting transition matrices. 

Once the distances between all the sequences are obtained, the actual 
clustering can be carried out using spectral clustering (or any other typi- 
cal technique). We refer to this algorithm as state-space dynamics (SSD) 
clustering. It is summarized in Alg. 1. 

It is worth noting that our proposal does not include any special ini- 
tialization of the large model representing the dataset, such as imposing a 
block-diagonal structure on the transition matrix to encourage the cluster- 
ing [6] . We do not aim to obtain a single generative model of the complete 
dataset, but an adequate common representation that allows for a subsequent 
successful non-parametric clustering. 

An important free parameter of our method is the number of hidden states 
of the common model. It should be chosen accordingly to the richness and 
complexity of the dataset. In the worst case (that is to say, assuming that 
there is no state sharing amongst different groups), it should grow linearly 
with the number of groups. In this work, we have fixed this size a priori, but 
it could be estimated using well-known criteria such as BIG or AIC [20] . Re- 
member that the forward-backward algorithm for HMMs is 0{K'^T), where 
K is the number of states and T the sequence length. This indicates that 
our proposal is specially suitable for datasets consisting of a large number of 
sequences coming from a small number of clusters, which is a usual case. In 
such a scenario, the number of hidden states required for a successful cluster- 
ing is low, so the time penalty in the FW-BW algorithm will be more than 
compensated by the significant computational load reduction coming from 
the linear complexity in the number of sequences. If sequences coming from 
different clusters share some emission distributions, the improvements will 
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Algorithm 1 SSD distance for clustering sequential data 

Inputs: 

Dataset S = {Si, . . . , Sjv}, sequences 
K: Number of hidden states 

Algorithm: 

Step 1: Learning the global model (Baum Welch) 
9 = argmaxg)/ P(Si, . . . , SAr|6'') 

Step 2: Estimating A" = {(^fj} (Forward/Backward) 
for all S„ do 

a,{t) = P{Sn{l), . . . ,Sn{t),qt = m 
Pkit) = PiSnit + 1), . . . , S„(T„), qt = m 
^ij oc aiit)aijp{xt+i\qt+i = i)l3j{t + 1) 

end for 

Step 3: Obtaining the distance matrix D = {dij} 
for all i,j do 

Pifc = k^^ row of A* 

dij = - log ;^ E^fc'=l VPik{k')Pjk{k') 
end for 

Step 4: Clustering using D 



be even more notorious, because the algorithm will exploit that sharing in a 

natural way. 

Finally, our work can be seen as similar to [21]. There, the authors 
propose a bayesian clustering method based on transition matrices of Markov 
chains. They assume that the sequences are discrete, so a Markov chain can 
be directly estimated via transition counts. Our proposal, on the other hand, 
uses Markov chains on the latent variables (states), what makes it far more 
general. Moreover, our focus is on defining a general model-based distance 
between sequences, so that the SSD distance can be directly coupled with a 
wide range of clustering algorithms depending on the task at hand. 
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4. Experimental Results 

In this section we present a thorough experimental comparison between 
SSD and state of the art algorithms using both synthetic and real- world data. 
Synthetic data include an ideal scenario where the sequences in the dataset 
arc actually generated using HMMs, as well as a control chart clustering 
task. Real data experiments include different scenarios (character, gesture 
and speaker clustering) selected from the UCI-ML [22] and UCI-KDD [23] 
repositories. The implementation of the compared algorithms is provided in 
the author's website^, except for PPK, which is available in Prof. Jebara's 
website^. 

The compared methods for obtaining the distance matrix are: SSD, 
state-space dynamics clustering with Bhattacharyya distance; PPK, Proba- 
bility Product Kernels [11]; KL, KL-divergence based distance [7]; BP, BP 
metric [8]; YY, Yin- Yang distance [10] and SYM, Symmetrized distance [6]. 

We denote the number of hidden states of the global model used by SSD 
as K, and the number of states per model of the methods that rely on training 
a HMM on each single sequence as Km- 

Once a distance matrix is available, we perform the actual clustering using 
the spectral algorithm described in [2]. The different distance matrices are 
turned into similarity matrices by means of a Gaussian kernel whose width is 
automatically selected in each case attending to the cigengap. Though more 
elaborated methods such as [24] can be used to select the kernel width, in our 
experiments it is automatically selected in each case attending to the eigengap 
since the experimental results are good enough. We assume that the number 
of clusters is known a priori. If this is not the case, automatic determination 
of the number of clusters can be carried out by methods such as those in 
[24, 25]. The PPK method directly returns a similarity matrix, that is first 
converted into a distance matrix by taking the negative logarithm of each 
element. Then, it is fed into the clustering algorithm with automatic kernel 
width selection. The final A;-means step of the spectral clustering algorithm 
is run 10 times, choosing as the final partition the one with the minimum 
intra-cluster distortion. The free parameter T of the PPK method is fixed 
to 10 following [11]. 

The results shown in the sequel are averaged over a number of iterations 



^h.ttp : / / www . tsc . uc3m . es/~dggarcia 

^http : / /wwwl . cs . Columbia. edu/~j ebara/ code .html 
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in order to account for the variability coming from the EM-based training 
of the HMM. Performance is measured in the form of clustering accuracy, 
understood as the percentage of correctly classified samples under an optimal 
permutation of the cluster labels, or its reciprocal, the clustering error. 

4.I. Synthetic data 

In this subsection we test the algorithms using two kinds of synthetically 
generated data: a mixture-of-HMMs (MoHMM) scenario as in [6, 7], and a 
UCI-ML dataset representing control charts. 

4.1.1. Mixture of HMMs 

Each sequence in this dataset is generated by a mixture of two equiprob- 
able HMMs 9i and 62. Each of these models has two hidden states, with an 
uniform initial distribution, and their corresponding transition matrices are 



Emission probabihties are the same in both models, specifically A^(0, 1) in the 
first state and A'^(3, 1) in the second. This is a deceptively simple scenario. 
Since both the emission probabilities and the equilibrium distributions are 
identical for both models, the only way to differentiate sequences generated 
by each of them is to attend to their dynamical characteristics. These, in 
turn, are very similar, making this a hard clustering task. The length of 
each individual sequence is uniformly distributed in the range [0.6//^, I.A/il], 
where /i^ is the mean length. 

Figure 1 shows the clustering error achieved by the compared algorithms 
in a dataset oi N — 100 sequences, averaged over 50 runs. All the algo- 
rithms use a correct model structure {K^ — 2 hidden states per class) to fit 
each sequence. For SSD, this implies using 4 hidden states for the common 
model {K = 4). As expected, when the sequences follow an HMM generative 
model and the representative model structure is chosen accordingly, SSD 
achieves impressive performance improvements for short sequence lengths. 
In contrast, algorithms that rely on training an HMM for each sequence suf- 
fer from poor model estimation when the mean sequence length is very low 
(< 100), which in turn produces bad clustering results. Our proposal over- 
comes this difficulty by using information from all the sequences in order 
to generate the common representative HMM. Consequently, the emission 
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probabilities are estimated much more accurately and the distances obtained 
are more meaningful, leading to a correct clustering. Nonetheless, when the 
sequences are long (> 200) very accurate models can be obtained from each 
single sequence and the different methods tend to converge in performance. 




Mean sequence length 

Figure 1: Clustering error for the MoHMM case 

4- 1-2. Synthetic Control Chart 

This dataset contains unidimensional time series representing six different 
classes of control charts: normal, cyclic, increasing trend, decreasing trend, 
upward shift and downward shift. There are 100 instances of each class, with 
a fixed length of 60 samples per instance. A sample of each class is plotted 
in Fig. 2. 

We carry out a multi-class clustering task on this dataset, partitioning 
the corpus into 6 groups which we expect to correspond with the different 
classes of control charts. As explained in Sec. 3, the size of the state-space 
for the HMM in SSD clustering should be chosen accordingly to the number 
of classes, so we employ a number of hidden states in the range 12-28. It 
also allows us to show that our proposal is robust enough to produce good 
performance in such an extense range. Results, averaged over 10 runs, are 
shown in Table 1. It should be pointed out that the methods SYM, YY, KL, 
BP, PPK could not handle the complete dataset of 100 instances per class in 
reasonable time, and had to be tested on a set of 30 sequences per class. 
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Figure 2: Some samples from the Synthetic Control Chart dataset 

SSD clearly outperforms the compared algorithms. It is also remarkable 
that these results confirm that our proposal benefits notably from larger 
datasets, as can be seen by comparing the performances for 30 and 100 
sequences per class. This is due to the fact that, in contrast to previous 
proposals, the modeling employed by SSD clustering improves as the dataset 
size increases. The confusion matrix when = 30 and K = 20 (averaged 
over the 10 runs) is shown in Fig. 3 in the form of a Hinton diagram. 




Normal Cyclic Inc. Trend Dec. Trend Up. Shift Down. Shift 



Figure 3: Confusion matrix for SSD clustering with 30 sequences per class and 20 hidden 

states 
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4-2. Real-world data clustering experiments 

We use the following datasets from the UCI ML and KDD archives: 

Character Trajectories: This dataset consists of trajectories captured by 
a digitizing tablet when writing 20 different characters. Each sample is 
a 3-dimensional vector containing the x and y coordinates as well as the 
pen tip force. The sequences are already differentiated and smoothed 
using a Gaussian kernel. We use 25 sequences per character, and carry 
out two-class clustering between all the possible combinations, giving 
a total of 190 experiments. The average length of the sequences in this 
dataset is around 170 samples. 

AUSL AN: The Australian Sign Language dataset is comprised of 22-dimens- 
ional time series representing different sign-language gestures. The ges- 
tures belong to a single signer, and were collected in different sessions 
over a period of nine weeks. There are 27 instances per gesture, with 
an average length of 57 samples. Following [11], we perform 2-class 
clustering tasks using semantically related concepts. These concepts 
are assumed to be represented by similar gestures and thus provide a 
difficult scenario. 

Japanese Vowels: To construct this dataset, nine male speakers uttered 
two Japanese vowels consecutively. The actual data is comprised of 
the 12-dimensional time-series of LPC cepstrum coefficients for each 
utterance, captured at a sampling rate of lOKHz using a sliding window 
of 25.6ms with a 6.4ms shift. The number of samples per sequence 
varies in the range 7-29 and there are 30 sequences per user. We use 
this dataset for two different tasks: speaker clustering and speaker 
segmentation. 

Table 2 shows the numerical results, averaged over 10 runs. In the Char- 
acter Trajectories dataset, the individual sequences are fairly long and the 
classes are mostly well separated, so this is an easy task. Single sequences 
are informative enough to produce good representative models and, conse- 
quently, most methods achieve very low error rates. Nonetheless, using the 
SSD distance outperforms the competitors. 

For the AUSLAN dataset, following [11], we used HMMs with = 2 
hidden states for the methods that train a single model per sequence. The 
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sequences were fed directly to the different algorithms without any prepro- 
cessing. We reproduce the results for the PPK method from [11]. The com- 
mon model for SSD employs K = 4 hidden states {2Km), since the 2- way 
clustering tasks are fairly simple in this case. It is worth noting that the bad 
performance in the 'Yes' vs 'No' case is due to the fact that the algorithms 
try to cluster the sequences attending to the recording session instead of to 
the actual sign they represent. Our proposal produces great results in this 
dataset, surpassing the rest of the methods in every pairing except for the 
pathological 'Yes' vs 'No' case. 

Finally, we carry out a 9-class speaker clustering task using the Japanese 
Vowels dataset. The large number of classes and their variability demands a 
large number of hidden states in the common HMM of SSD. This, in turn, 
means a time penalty as the HMM training time is quadratic in the number 
of hidden states. Nonetheless, the performance obtained in this dataset by 
our proposal is very competitive in terms of clustering accuracy, only being 
surpassed by the KL method. It is also remarkable how the SSD-based 
clustering exhibits a very stable performance in a huge range of state-space 
cardinalities, what confirms our intuition that an accurate determination of 
that parameter is not crucial to the algorithm. 

4-2.1. Speaker Segmentation 

In order to briefly show another application of sequence clustering, we 
have reproduced the Japanese Vowels Dataset segmentation experiment from 
[26] using our proposed SSD distance. This scenario is constructed by con- 
catenating all the individual sequences in the dataset to form a long sequence 
which we would like to divide into 9 segments (one per user). Only two meth- 
ods, KL and SSD have been tested in this task because KL they were the 
best-performing methods in the clustering task. In order to carry out the 
sequence-clustering-based segmentation, we first extract subsequences using 
a non-overlapping window. The length of these windows range from 10 to 20 
samples. When using the KL distance, each subsequence is modeled using a 
2-state HMM, which is the optimal value for the previously shown clustering 
task. A distance matrix is then obtained from these subsequences using the 
KL or SSD distances, and then spectral segmentation (SS) is carried out 
instead of spectral clustering [26] . This amounts to performing dynamic pro- 
gramming (DP) on the eigenvectors resulting from the decomposition of the 
normalized Laplacian matrix used for spectral clustering. In the ideal case, 
those eigenvectors are expected to be approximately piece-wise constant on 
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each cluster, thus being a very adequate representation to run DP on in order 
to obtain a segmentation of the subsequences. 

Table 3 shows the performance obtained in this task. Segmentation error 
is measured as the number of incorrectly "classified" segments in the se- 
quence, which can be seen as the fraction of the time that the segmentation 
algorithm is giving wrong results. As the results confirm, using the SSD dis- 
tance is very adequate in this scenario because the number of subsequences 
is quite large and, at the same time, all of them are very short. This way, 
we exploit both the reduced time complexity in the number of sequences and 
the better estimation of the emission distributions. 

5. Conclusions 

In this paper we have presented a new distance for model-based sequence 
clustering using state-space models. We learn a single model representing 
the whole dataset and then obtain distances between sequences attending to 
their dynamics in the common state-space that this model provides. It has 
been empirically shown that the proposed approach outperforms the previous 
semi-parametric methods, specially when the mean sequence length is short. 
Furthermore, the proposed method scales much better with the dataset size 
(linearly vs quadratically). As drawback of this method it should be men- 
tioned that, as the number of classes grow, the common model may need a 
large number of hidden states to correctly represent the dataset (although 
the method is empirically shown not to be too sensitive to the accurate deter- 
mination of the model size). In the case of hidden Markov models, the time 
complexity of the training procedure is quadratic in this number of states, 
so total running time can be high in these cases. Consequently, we find our 
proposal specially appealing for scenarios with a large number of sequences 
coming from a few different classes, which is a very usual case. 

Promising lines for future work include the application of this methodol- 
ogy to other state-space models, both discrete and continuous, and to semi- 
supervised scenarios. We are also investigating alternative definitions of dis- 
tance measures between transition matrices in order to take into account the 
potential redundancy of the state-space. 
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Table 1: Mean accuracy (standard deviation in brackets) in the Control Chart dataset. SYM, YY, KL, BP, PPK had to be 
tested with 30 sequences per class due to computational reasons. 



# hidden stat. 



SYM 



YY 



KL 



BP 



PPK 



76.11% (±0.1) 
74.00% (±0.6) 
74.78% (±0.6) 
77.00% (±1.8) 
74.67% (±0.3) 



74.11% (±5.3) 
74.67% (±0.5) 
76.44% (±1.6) 
79.11% (±3.2) 
76.89% (±3.1) 



74.67% (±5.3) 
78.67% (±3.9) 
79.33% (±3.1) 
77.44% (±4.9) 
76.22% (±2.9) 



46.89% (±3.3) 
55.22% (±0.6) 
51.22% (±4.8) 
41.78% (±3.1) 
40.11% (±3.5) 



K =2 
K =4 



78.33% (±0.2) 
75.89% (±3.6) 
76.00% (±4.4) 
79.78% (±1.6) 
74.44% (±2.0) 



# of hidden states 



^=12 
K=1Q 
K=2Q 
K^2S 



SSD (30 seq/class) SSD (100 seq/class) 



81.61% (±6.0) 
86.28% (±6.3) 
87.33% (±4.1) 
88.19% (±4.7) 



91.57% (±1.5) 
92.71% (±1.8) 
93.23% (±1.4) 
94.07% (±1.1) 



Table 2: Performance on the Character Trajectories (top), AUSLAN (middle) and 
Japanese Vowels (bottom, 9-class clustering task) datasets. The standard deviation of 
the results for the AUSLAN dataset is in every case except for 'SPEND' vs 'COST' 
using YY distance, with a value of 0.8. The number of hidden states is ii" = 4 for SSD 
and Km = 2 for the rest of methods (best case) 



# hidden stat. 



SYM 


YY 


KL 


BP 


PPK 


96.10% 


97.02% 


96.42% 


96.57% 


76.72% 


(±0.4) 


(±0.2) 


(±0.1) 


(±0.2) 


(±0.8) 


96.30% 


96.90% 


96.45% 


95.23% 


67.66% 


(±0.1) 


(±0.2) 


(±0.0) 


(±0.3) 


(±0.8) 


95.31% 


95.53% 


95.69% 


83.92% 


62.25% 


(±0.2) 


(±0.0) 


(±0.1) 


(±0.8) 


(±0.2) 


96.28% 


96.40% 


96.58% 


84.70% 


61.39% 


(±0.3) 


(±0.1) 


(±0.1) 


(±0.1) 


(±0.2) 



# hidden stat. 



SSD 



Km = 2 



Km = 3 



Km = 4: 



Km = 5 



K=U 



K=16 



K=20 



K=22 



97.17% 
(±0.3) 
97.58% 
(±0.2) 
98.23% 
(±0.2) 
98.35% 
(±0.2) 



SIGNS 



'HOT' vs 'COLD' 
'EAT' vs 'DRINK' 
'HAPPY' vs 'SAD' 
'SPEND' vs 'COST' 
'YES' vs 'NO' 



SYM 



YY 



KL 



BP 



PPK 



SSD 



100% 100% 100% 100% 100% 

51.85% 92.59% 92.59% 92.59% 93% 

59.26% 98.15% 100% 98.15% 87% 

54.07% 99.63% 100% 100% 80% 

60.36% 55.56% 55.56% 55.56% 59% 



100% 
95.37% 
100% 
100% 

56.66% 



# hidden stat. 


SYM 


YY 


KL 


BP 


PPK 


# hidden stat. 


SSD 




66.67% 


85.11% 


90.15% 


85.30 % 


75.48% 




82.30% 


Km = 2 












K=20 






(±2.8) 


(±2.1) 


(±1.7) 


(±1.0) 


(±3.7) 




(±3.7) 




67.18% 


79.01% 


81.14% 


75.44% 


82.59% 




85.48% 


Km = 3 












K=30 






(±3.0) 


(±4.1) 


(±4.0) 


(±3.7) 


(±5.1) 




(±4.5) 




67.96% 


83.41% 


85.55% 


78.55% 


79.78% 




87.93% 


Km = 4 




K=40 






(±3.4) 


(±6.2) 


(±5.4) 


(±4.8) 


(±3.7) 




(±4.4) 


Km = 5 


70.44% 


82.81% 


82.30% 


79.77% 


78.15% 


K=50 


86.37% 
















(±3.6) 


(±5.2) 


(±5.5) 


(±4.4) 


(±3.9) 




(±6.9) 
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Table 3: Segmentation error (mean and standard deviation) in the Japanese Vowels dataset (9 sources and segments) using 
KL and SSD distances 





KL-SS 




SSD-SS 




Window length 






K^24: X=32 


X=40 


W=10 


1.0% (±0.22) 


1.61% (±0) 


1.0% (±0.29) 0.82% (±0.35) 


0.72% (±0.33) 


W=15 


3.5% (±0) 


2.39% (±0) 


1.12% (±0.39) 0.95% (±0.29) 


0.98% (±0.27) 


W=20 


1.4% (±0) 


2.66% (±0) 


1.26% (±0.44) 1.03% (±0.53) 


0.75% (±0.45) 



